Project - Unsupervised Learning

Author: Bright Kyeremeh (MrBriit)


Data Description:

The data contains features extracted from the silhouette of vehicles in different
angles. Four "Corgie" model vehicles were used for the experiment: a double
decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This
particular combination of vehicles was chosen with the expectation that the
bus, van and either one of the cars would be readily distinguishable, but it
would be more difficult to distinguish between the cars.

Domain:

Object recognition

Context:

The purpose is to classify a given silhouette as one of three types of vehicle,
using a set of features extracted from the silhouette. The vehicle may be viewed
from one of many different angles.

Attribute Information:

● All the features are geometric features extracted from the silhouette.

● All are numeric in nature.

Learning Outcomes:

● Exploratory Data Analysis

● Reduce number dimensions in the dataset with minimal information loss

● Train a model using Principle Components

Objective:

Apply dimensionality reduction technique – PCA and train a model using
principle components instead of training the model using just the raw data.


NB: In this notebook, we will be importing our libraries as at when we need.

In [1]:
Import pandas as pd #loading our dataset
df= pd.read_csv("vehicle-1.csv")
In [2]:
df.head() # view the first 5 rows
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus


Data pre-processing

In [3]:
df.dtypes #checking the data types of each column
Out[3]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                           object
dtype: object
In [4]:
df.shape #the shape of the dataset
Out[4]:
(846, 19)
In [5]:
df.describe().T #the 5-number summary of the dataset
Out[5]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
In [6]:
df.isna().sum() #checking for null values
Out[6]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [7]:
#Defining dependent variables and the independent variables

#creating a copy in order to comapare the two datasets (with and without missing values)
#newdf = df.copy()

X = df.iloc[:,0:18] #selecting the numerical attributes 
y = df.iloc[:,18] #selecting class attribute. 


Fill missing values with median of the columns

In [8]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='median', verbose=1)

transformed_values = imputer.fit_transform(X)
column = X.columns
df1 = pd.DataFrame(transformed_values, columns = column )
In [9]:
df1.isnull().sum()
Out[9]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
dtype: int64


Understanding the attributes

In [85]:
#Distribution of the independent variables

plt.style.use('seaborn-whitegrid')

df1.hist(bins=20, figsize=(60,40), color='lightblue', edgecolor = 'red')
plt.show()

From the above plot, it can be seen most of the attributes are normally distributed with few skewed to the right and left.

In [11]:
#boxplot distribution of the independent variables

plt.figure(figsize= (30,20))
sns.boxplot(data=df1,orient="h")
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a17579fd0>

Scaled_variance_1 is having a huge effect on our data distribution since is having a wider measurement of scale. We will therefore drop that and visualise it alone while we visualise the rest of the data set.

In [12]:
#boxplot distribution of the independent variables without Scaled_variance_1 

plt.figure(figsize= (30,20))
sns.boxplot(data=df1.drop('scaled_variance.1',axis=1),orient="h")
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a16f24358>

now we can have a better look at our data and can observe that many of data attributes contains outliers

In [13]:
#Distribution of variables with most outliers

plt.figure(figsize= (20,15))
plt.subplot(3,3,1)
sns.boxplot(x= df1['max.length_aspect_ratio'], color='green')

plt.subplot(3,3,2)
sns.boxplot(x= df1['scaled_radius_of_gyration.1'], color='lightblue')

plt.subplot(3,3,3)
sns.boxplot(x= df1['scaled_radius_of_gyration.1'], color='lightblue')

plt.show()

As can be seen there a lot of outliers present in max.length_aspect_ratio,scaled_radius_of_gyration.1 and scaled_radius_of_gyration.1

Dealing with the outliers

In [14]:
from scipy.stats import iqr

Q1 = df1.quantile(0.25)
Q3 = df1.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
compactness                     13.00
circularity                      9.00
distance_circularity            28.00
radius_ratio                    54.00
pr.axis_aspect_ratio             8.00
max.length_aspect_ratio          3.00
scatter_ratio                   51.00
elongatedness                   13.00
pr.axis_rectangularity           4.00
max.length_rectangularity       22.00
scaled_variance                 50.00
scaled_variance.1              268.50
scaled_radius_of_gyration       49.00
scaled_radius_of_gyration.1      8.00
skewness_about                   7.00
skewness_about.1                14.00
skewness_about.2                 9.00
hollows_ratio                   10.75
dtype: float64
In [15]:
df2 = df1[~((df1 < (Q1 - 1.5 * IQR)) |(df1 > (Q3 + 1.5 * IQR))).any(axis=1)]
In [16]:
df2.shape
Out[16]:
(813, 18)
In [17]:
df1.shape
Out[17]:
(846, 18)

Visualising the data without outliers

In [18]:
plt.figure(figsize= (20,15))
plt.subplot(8,8,1)
sns.boxplot(x= df2['pr.axis_aspect_ratio'], color='orange')

plt.subplot(8,8,2)
sns.boxplot(x= df2['skewness_about'], color='purple')

plt.subplot(8,8,3)
sns.boxplot(x= df2['scaled_variance'], color='brown')

plt.subplot(8,8,4)
sns.boxplot(x= df2['radius_ratio'], color='red')

plt.subplot(8,8,5)
sns.boxplot(x= df2['scaled_radius_of_gyration.1'], color='lightblue')

plt.subplot(8,8,6)
sns.boxplot(x= df2['scaled_variance.1'], color='yellow')

plt.subplot(8,8,7)
sns.boxplot(x= df2['max.length_aspect_ratio'], color='lightblue')

plt.subplot(8,8,8)
sns.boxplot(x= df2['skewness_about.1'], color='pink')

plt.show()

We can clearly seen that the outliers are being removed. We could have ignored this process if the outliers were too many and out datasets were also large.

In [19]:
#Counts of our dependent variable

print(y.value_counts())

#splitscaledf = df1.copy()
sns.countplot(y)
plt.show()
car    429
bus    218
van    199
Name: class, dtype: int64

Correlation between the independent attributes

In [20]:
df1.corr()
Out[20]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
compactness 1.000000 0.684887 0.789928 0.689743 0.091534 0.148249 0.812620 -0.788750 0.813694 0.676143 0.762070 0.814012 0.585243 -0.249593 0.236078 0.157015 0.298537 0.365552
circularity 0.684887 1.000000 0.792320 0.620912 0.153778 0.251467 0.847938 -0.821472 0.843400 0.961318 0.796306 0.835946 0.925816 0.051946 0.144198 -0.011439 -0.104426 0.046351
distance_circularity 0.789928 0.792320 1.000000 0.767035 0.158456 0.264686 0.905076 -0.911307 0.893025 0.774527 0.861519 0.886017 0.705771 -0.225944 0.113924 0.265547 0.146098 0.332732
radius_ratio 0.689743 0.620912 0.767035 1.000000 0.663447 0.450052 0.734429 -0.789481 0.708385 0.568949 0.793415 0.718436 0.536372 -0.180397 0.048713 0.173741 0.382214 0.471309
pr.axis_aspect_ratio 0.091534 0.153778 0.158456 0.663447 1.000000 0.648724 0.103732 -0.183035 0.079604 0.126909 0.272910 0.089189 0.121971 0.152950 -0.058371 -0.031976 0.239886 0.267725
max.length_aspect_ratio 0.148249 0.251467 0.264686 0.450052 0.648724 1.000000 0.166191 -0.180140 0.161502 0.305943 0.318957 0.143253 0.189743 0.295735 0.015599 0.043422 -0.026081 0.143919
scatter_ratio 0.812620 0.847938 0.905076 0.734429 0.103732 0.166191 1.000000 -0.971601 0.989751 0.809083 0.948662 0.993012 0.799875 -0.027542 0.074458 0.212428 0.005628 0.118817
elongatedness -0.788750 -0.821472 -0.911307 -0.789481 -0.183035 -0.180140 -0.971601 1.000000 -0.948996 -0.775854 -0.936382 -0.953816 -0.766314 0.103302 -0.052600 -0.185053 -0.115126 -0.216905
pr.axis_rectangularity 0.813694 0.843400 0.893025 0.708385 0.079604 0.161502 0.989751 -0.948996 1.000000 0.810934 0.934227 0.988213 0.796690 -0.015495 0.083767 0.214700 -0.018649 0.099286
max.length_rectangularity 0.676143 0.961318 0.774527 0.568949 0.126909 0.305943 0.809083 -0.775854 0.810934 1.000000 0.744985 0.794615 0.866450 0.041622 0.135852 0.001366 -0.103948 0.076770
scaled_variance 0.762070 0.796306 0.861519 0.793415 0.272910 0.318957 0.948662 -0.936382 0.934227 0.744985 1.000000 0.945678 0.778917 0.113078 0.036729 0.194239 0.014219 0.085695
scaled_variance.1 0.814012 0.835946 0.886017 0.718436 0.089189 0.143253 0.993012 -0.953816 0.988213 0.794615 0.945678 1.000000 0.795017 -0.015401 0.076877 0.200811 0.006219 0.102935
scaled_radius_of_gyration 0.585243 0.925816 0.705771 0.536372 0.121971 0.189743 0.799875 -0.766314 0.796690 0.866450 0.778917 0.795017 1.000000 0.191473 0.166483 -0.056153 -0.224450 -0.118002
scaled_radius_of_gyration.1 -0.249593 0.051946 -0.225944 -0.180397 0.152950 0.295735 -0.027542 0.103302 -0.015495 0.041622 0.113078 -0.015401 0.191473 1.000000 -0.088355 -0.126183 -0.748865 -0.802123
skewness_about 0.236078 0.144198 0.113924 0.048713 -0.058371 0.015599 0.074458 -0.052600 0.083767 0.135852 0.036729 0.076877 0.166483 -0.088355 1.000000 -0.034990 0.115297 0.097126
skewness_about.1 0.157015 -0.011439 0.265547 0.173741 -0.031976 0.043422 0.212428 -0.185053 0.214700 0.001366 0.194239 0.200811 -0.056153 -0.126183 -0.034990 1.000000 0.077310 0.204990
skewness_about.2 0.298537 -0.104426 0.146098 0.382214 0.239886 -0.026081 0.005628 -0.115126 -0.018649 -0.103948 0.014219 0.006219 -0.224450 -0.748865 0.115297 0.077310 1.000000 0.892581
hollows_ratio 0.365552 0.046351 0.332732 0.471309 0.267725 0.143919 0.118817 -0.216905 0.099286 0.076770 0.085695 0.102935 -0.118002 -0.802123 0.097126 0.204990 0.892581 1.000000
In [21]:
#Heatmap of the correlation between the indepent attributes

plt.figure(figsize=(30,15))
sns.heatmap(df1.corr(), vmax=1, square=True,annot=True,cmap='viridis')
plt.title('Correlation between different attributes')
plt.show()
  - pr.axis_recatngularity and Scaled Variance.1 have very high correlated with value of 0.99
  - scatter_ratio and pr.axis_recatngularity have very high correlated with value of 0.99
  - max.length_recatngularity and circularity also have very high correlated with value of 0.96

     Among other features as well.

     However, there some features which very low correlated and even negatively correlated such as:

   - skewness_about.2 and circularity with a value of -0.1
   - scaled_radius_of_gyration_1 and radius_ratio with a value of -0.18

   Among other relationships can be clearly seen from the heatmap
In [22]:
#Pairplot of the correlation/distribution between various independent attributes
sns.pairplot(df1, diag_kind="kde")
Out[22]:
<seaborn.axisgrid.PairGrid at 0x1a17212da0>

The pairplot above validate the insights derived from our earlier heatmap. Scaled Variance & Scaled Variance.1 seems to be have very strong positive correlation with value of 0.95. skewness_about_2 and hollow_ratio also seems to have strong positive correation with coeff: 0.89

scatter_ratio and elongatedness seems to be have very strong negative correlation. elongatedness and pr.axis_rectangularity seems to have strong negative correlation with val of -0.97

We found from our pairplot analysis that, Scaled Variance & Scaled Variance.1 and elongatedness and pr.axis_rectangularity to be strongly correlated , so they need to dropped of treated carefully before we go for model building.



Choosing the right attributes for model building

With our objective of predicting an object to be a van or bus or car based on some input features, ideally, we assume that there is little or no multicollinearity between the features. If otherwise our data contains features that are highly correlated then we will encounter what is know as “Multicollinearity”.

Multicollinearity can lead to a misleading results. This situation happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy.

It is quit reasonable to drop one feature if we have 2 features in our datset which are highly correlated since there's no point in using both features. From the above heatmap as well as the pairplot, we recongnised that there lots of features that are highly correlated, negatively or positively with values as high as 0.99 and -0.97.

These features are listed below:

    max.length_rectangularity
    scaled_radius_of_gyration
    skewness_about.2
    scatter_ratio
    elongatedness
    pr.axis_rectangularity
    scaled_variance
    scaled_variance.1

As pointed out earlier,the easiest way to deal with multicollinearity is to delete one of the highly correlated features. However, we will be using a better approach known as dimension reduction specifically Principle Component Analysis (PCA).

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components.

Performing PCA

In [91]:
#printing the shape of dependent and independnet attributes
print("shape of Independent attributes:",df1.shape)
print("shape of Dependent attributes:",y.shape)
shape of Independent attributes: (846, 18)
shape of Dependent attributes: (846,)
In [92]:
from scipy.stats import zscore
XScaled=df1.apply(zscore)
XScaled.head()
Out[92]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.160580 0.518073 0.057177 0.273363 1.310398 0.311542 -0.207598 0.136262 -0.224342 0.758332 -0.401920 -0.341934 0.285705 -0.327326 -0.073812 0.380870 -0.312012 0.183957
1 -0.325470 -0.623732 0.120741 -0.835032 -0.593753 0.094079 -0.599423 0.520519 -0.610886 -0.344578 -0.593357 -0.619724 -0.513630 -0.059384 0.538390 0.156798 0.013265 0.452977
2 1.254193 0.844303 1.519141 1.202018 0.548738 0.311542 1.148719 -1.144597 0.935290 0.689401 1.097671 1.109379 1.392477 0.074587 1.558727 -0.403383 -0.149374 0.049447
3 -0.082445 -0.623732 -0.006386 -0.295813 0.167907 0.094079 -0.750125 0.648605 -0.610886 -0.344578 -0.912419 -0.738777 -1.466683 -1.265121 -0.073812 -0.291347 1.639649 1.529056
4 -1.054545 -0.134387 -0.769150 1.082192 5.245643 9.444962 -0.599423 0.520519 -0.610886 -0.275646 1.671982 -0.648070 0.408680 7.309005 0.538390 -0.179311 -1.450481 -1.699181
In [93]:
#Alternatively

# from sklearn.preprocessing import StandardScaler
# #We transform (centralize) the entire X (independent variable data) to normalize it using standardscalar through transformation. We will create the PCA dimensions
# # on this distribution. 
# sdsc = StandardScaler()
# X_std =  sdsc.fit_transform(df1) 
In [94]:
#Getting the confusion matrix

covMatrix = np.cov(XScaled,rowvar=False)
print(covMatrix)
[[ 1.00118343  0.68569786  0.79086299  0.69055952  0.09164265  0.14842463
   0.81358214 -0.78968322  0.81465658  0.67694334  0.76297234  0.81497566
   0.58593517 -0.24988794  0.23635777  0.15720044  0.29889034  0.36598446]
 [ 0.68569786  1.00118343  0.79325751  0.6216467   0.15396023  0.25176438
   0.8489411  -0.82244387  0.84439802  0.96245572  0.79724837  0.83693508
   0.92691166  0.05200785  0.14436828 -0.01145212 -0.10455005  0.04640562]
 [ 0.79086299  0.79325751  1.00118343  0.76794246  0.15864319  0.26499957
   0.90614687 -0.9123854   0.89408198  0.77544391  0.86253904  0.88706577
   0.70660663 -0.22621115  0.1140589   0.26586088  0.14627113  0.33312625]
 [ 0.69055952  0.6216467   0.76794246  1.00118343  0.66423242  0.45058426
   0.73529816 -0.79041561  0.70922371  0.56962256  0.79435372  0.71928618
   0.53700678 -0.18061084  0.04877032  0.17394649  0.38266622  0.47186659]
 [ 0.09164265  0.15396023  0.15864319  0.66423242  1.00118343  0.64949139
   0.10385472 -0.18325156  0.07969786  0.1270594   0.27323306  0.08929427
   0.12211524  0.15313091 -0.05843967 -0.0320139   0.24016968  0.26804208]
 [ 0.14842463  0.25176438  0.26499957  0.45058426  0.64949139  1.00118343
   0.16638787 -0.18035326  0.16169312  0.30630475  0.31933428  0.1434227
   0.18996732  0.29608463  0.01561769  0.04347324 -0.02611148  0.14408905]
 [ 0.81358214  0.8489411   0.90614687  0.73529816  0.10385472  0.16638787
   1.00118343 -0.97275069  0.99092181  0.81004084  0.94978498  0.9941867
   0.80082111 -0.02757446  0.07454578  0.21267959  0.00563439  0.1189581 ]
 [-0.78968322 -0.82244387 -0.9123854  -0.79041561 -0.18325156 -0.18035326
  -0.97275069  1.00118343 -0.95011894 -0.77677186 -0.93748998 -0.95494487
  -0.76722075  0.10342428 -0.05266193 -0.18527244 -0.11526213 -0.2171615 ]
 [ 0.81465658  0.84439802  0.89408198  0.70922371  0.07969786  0.16169312
   0.99092181 -0.95011894  1.00118343  0.81189327  0.93533261  0.98938264
   0.79763248 -0.01551372  0.08386628  0.21495454 -0.01867064  0.09940372]
 [ 0.67694334  0.96245572  0.77544391  0.56962256  0.1270594   0.30630475
   0.81004084 -0.77677186  0.81189327  1.00118343  0.74586628  0.79555492
   0.86747579  0.04167099  0.13601231  0.00136727 -0.10407076  0.07686047]
 [ 0.76297234  0.79724837  0.86253904  0.79435372  0.27323306  0.31933428
   0.94978498 -0.93748998  0.93533261  0.74586628  1.00118343  0.94679667
   0.77983844  0.11321163  0.03677248  0.19446837  0.01423606  0.08579656]
 [ 0.81497566  0.83693508  0.88706577  0.71928618  0.08929427  0.1434227
   0.9941867  -0.95494487  0.98938264  0.79555492  0.94679667  1.00118343
   0.79595778 -0.01541878  0.07696823  0.20104818  0.00622636  0.10305714]
 [ 0.58593517  0.92691166  0.70660663  0.53700678  0.12211524  0.18996732
   0.80082111 -0.76722075  0.79763248  0.86747579  0.77983844  0.79595778
   1.00118343  0.19169941  0.16667971 -0.05621953 -0.22471583 -0.11814142]
 [-0.24988794  0.05200785 -0.22621115 -0.18061084  0.15313091  0.29608463
  -0.02757446  0.10342428 -0.01551372  0.04167099  0.11321163 -0.01541878
   0.19169941  1.00118343 -0.08846001 -0.12633227 -0.749751   -0.80307227]
 [ 0.23635777  0.14436828  0.1140589   0.04877032 -0.05843967  0.01561769
   0.07454578 -0.05266193  0.08386628  0.13601231  0.03677248  0.07696823
   0.16667971 -0.08846001  1.00118343 -0.03503155  0.1154338   0.09724079]
 [ 0.15720044 -0.01145212  0.26586088  0.17394649 -0.0320139   0.04347324
   0.21267959 -0.18527244  0.21495454  0.00136727  0.19446837  0.20104818
  -0.05621953 -0.12633227 -0.03503155  1.00118343  0.07740174  0.20523257]
 [ 0.29889034 -0.10455005  0.14627113  0.38266622  0.24016968 -0.02611148
   0.00563439 -0.11526213 -0.01867064 -0.10407076  0.01423606  0.00622636
  -0.22471583 -0.749751    0.1154338   0.07740174  1.00118343  0.89363767]
 [ 0.36598446  0.04640562  0.33312625  0.47186659  0.26804208  0.14408905
   0.1189581  -0.2171615   0.09940372  0.07686047  0.08579656  0.10305714
  -0.11814142 -0.80307227  0.09724079  0.20523257  0.89363767  1.00118343]]
In [95]:
covMatrix.shape #shape of the confusion matrix
Out[95]:
(18, 18)
In [96]:
#Performing PCA on all the 18 components

from sklearn.decomposition import PCA
pca = PCA(n_components=18)
pca.fit(XScaled)
Out[96]:
PCA(copy=True, iterated_power='auto', n_components=18, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)


The eigen Values

In [97]:
print(pca.explained_variance_) 
[9.40460261e+00 3.01492206e+00 1.90352502e+00 1.17993747e+00
 9.17260633e-01 5.39992629e-01 3.58870118e-01 2.21932456e-01
 1.60608597e-01 9.18572234e-02 6.64994118e-02 4.66005994e-02
 3.57947189e-02 2.74120657e-02 2.05792871e-02 1.79166314e-02
 1.00257898e-02 2.96445743e-03]

The eigen Vectors

In [98]:
print(pca.components_)
[[ 2.75283688e-01  2.93258469e-01  3.04609128e-01  2.67606877e-01
   8.05039890e-02  9.72756855e-02  3.17092750e-01 -3.14133155e-01
   3.13959064e-01  2.82830900e-01  3.09280359e-01  3.13788457e-01
   2.72047492e-01 -2.08137692e-02  4.14555082e-02  5.82250207e-02
   3.02795063e-02  7.41453913e-02]
 [-1.26953763e-01  1.25576727e-01 -7.29516436e-02 -1.89634378e-01
  -1.22174860e-01  1.07482875e-02  4.81181371e-02  1.27498515e-02
   5.99352482e-02  1.16220532e-01  6.22806229e-02  5.37843596e-02
   2.09233172e-01  4.88525148e-01 -5.50899716e-02 -1.24085090e-01
  -5.40914775e-01 -5.40354258e-01]
 [-1.19922479e-01 -2.48205467e-02 -5.60143254e-02  2.75074211e-01
   6.42012966e-01  5.91801304e-01 -9.76283108e-02  5.76484384e-02
  -1.09512416e-01 -1.70641987e-02  5.63239801e-02 -1.08840729e-01
  -3.14636493e-02  2.86277015e-01 -1.15679354e-01 -7.52828901e-02
   8.73592034e-03  3.95242743e-02]
 [ 7.83843562e-02  1.87337408e-01 -7.12008427e-02 -4.26053415e-02
   3.27257119e-02  3.14147277e-02 -9.57485748e-02  8.22901952e-02
  -9.24582989e-02  1.88005612e-01 -1.19844008e-01 -9.17449325e-02
   2.00095228e-01 -6.55051354e-02  6.04794251e-01 -6.66114117e-01
   1.05526253e-01  4.74890311e-02]
 [ 6.95178336e-02 -8.50649539e-02  4.06645651e-02 -4.61473714e-02
  -4.05494487e-02  2.13432566e-01 -1.54853055e-02  7.68518712e-02
   2.17633157e-03 -6.06366845e-02 -4.56472367e-04 -1.95548315e-02
  -6.15991681e-02  1.45530146e-01  7.29189842e-01  5.99196401e-01
  -1.00602332e-01 -2.98614819e-02]
 [ 1.44875476e-01 -3.02731148e-01 -1.38405773e-01  2.48136636e-01
   2.36932611e-01 -4.19330747e-01  1.16100153e-01 -1.41840112e-01
   9.80561329e-02 -4.61674972e-01  2.36225434e-01  1.57820194e-01
  -1.35576278e-01  2.41356821e-01  2.03209257e-01 -1.91960802e-01
   1.56939174e-01 -2.41222817e-01]
 [ 4.51862331e-01 -2.49103387e-01  7.40350569e-02 -1.76912814e-01
  -3.97876601e-01  5.03413610e-01  6.49879382e-02  1.38112945e-02
   9.66573058e-02 -1.04552173e-01  1.14622578e-01  8.37350220e-02
  -3.73944382e-01  1.11952983e-01 -8.06328902e-02 -2.84558723e-01
   1.81451818e-02  1.57237839e-02]
 [-5.66136785e-01 -1.79851809e-01  4.34748988e-01  1.01998360e-01
  -6.87147927e-02  1.61153097e-01  1.00688056e-01 -2.15497166e-01
   6.35933915e-02 -2.49495867e-01  5.02096319e-02  4.37649907e-02
  -1.08474496e-01 -3.40878491e-01  1.56487670e-01 -2.08774083e-01
  -3.04580219e-01 -3.04186304e-02]
 [-4.84418105e-01 -1.41569001e-02 -1.67572478e-01 -2.30313563e-01
  -2.77128307e-01  1.48032250e-01  5.44574214e-02 -1.56867362e-01
   5.24978759e-03 -6.10362445e-02  2.97588112e-01  8.33669838e-02
   2.41655483e-01  3.20221887e-01  2.21054148e-02  1.01761758e-02
   5.17222779e-01  1.71506343e-01]
 [-2.60076393e-01  9.80779086e-02 -2.05031597e-01 -4.77888949e-02
   1.08075009e-01 -1.18266345e-01  1.65167200e-01 -1.51612333e-01
   1.93777917e-01  4.69059999e-01 -1.29986011e-01  1.58203940e-01
  -6.86493700e-01  1.27648385e-01  9.83643219e-02 -3.55150608e-02
   1.93956186e-02  6.41314778e-02]
 [ 4.65342885e-02  3.01323693e-03  7.06489498e-01 -1.07151583e-01
   3.85169721e-02 -2.62254132e-01 -1.70405800e-01 -5.76632611e-02
  -2.72514033e-01  1.41434233e-01  7.72596638e-02 -2.43226301e-01
  -1.58888394e-01  4.19188664e-01 -1.25447648e-02 -3.27808069e-02
   1.20597635e-01  9.19597847e-02]
 [ 1.20344026e-02 -2.13635088e-01  3.46330345e-04 -1.57049977e-01
   1.10106595e-01 -1.32935328e-01  9.55883216e-02  1.22012715e-01
   2.51281206e-01 -1.24529334e-01 -2.15011644e-01  1.75685051e-01
   1.90336498e-01  2.85710601e-01 -1.60327156e-03 -8.32589542e-02
  -3.53723696e-01  6.85618161e-01]
 [ 1.56136836e-01  1.50116709e-02 -2.37111452e-01 -3.07818692e-02
  -3.92804479e-02  3.72884301e-02  3.94638419e-02 -8.10394855e-01
  -2.71573184e-01 -7.57105808e-02 -1.53180808e-01 -3.07948154e-01
   3.76087492e-02  4.34650674e-02  9.94304634e-03  2.68915150e-02
  -1.86595152e-01  1.42380007e-01]
 [-6.00485194e-02  4.26993118e-01 -1.46240270e-01  5.21374718e-01
  -3.63120360e-01 -6.27796802e-02 -6.40502241e-02  1.86946145e-01
  -1.80912790e-01 -1.74070296e-01  2.77272123e-01 -7.85141734e-02
  -2.00683948e-01  1.46861607e-01  1.73360301e-02 -3.13689218e-02
  -2.31451048e-01  2.88502234e-01]
 [-9.67780251e-03 -5.97862837e-01 -1.57257142e-01  1.66551725e-01
  -6.36138719e-02 -8.63169844e-02 -7.98693109e-02  4.21515054e-02
  -1.44490635e-01  5.11259153e-01  4.53236855e-01 -1.26992250e-01
   1.09982525e-01 -1.11271959e-01  2.40943096e-02 -9.89651885e-03
  -1.82212045e-01  9.04014702e-02]
 [-6.50956666e-02 -2.61244802e-01  7.82651714e-02  5.60792139e-01
  -3.22276873e-01  4.87809642e-02  1.81839668e-02 -2.50330194e-02
   1.64490784e-01  1.47280090e-01 -5.64444637e-01 -6.85856929e-02
   1.47099233e-01  2.32941262e-01 -2.77589170e-02  2.78187408e-03
   1.90629960e-01 -1.20966490e-01]
 [ 6.00532537e-03 -7.38059396e-02  2.50791236e-02  3.59880417e-02
  -1.25847434e-02  2.84168792e-02  2.49652703e-01  4.21478467e-02
  -7.17396292e-01  4.70233017e-02 -1.71503771e-01  6.16589383e-01
   2.64910290e-02  1.42959461e-02 -1.74310271e-03  7.08894692e-03
  -7.67874680e-03 -6.37681817e-03]
 [-1.00728764e-02 -9.15939674e-03  6.94599696e-03 -4.20156482e-02
   3.12698087e-02 -9.99915816e-03  8.40975659e-01  2.38188639e-01
  -1.01154594e-01 -1.69481636e-02  6.04665108e-03 -4.69202757e-01
   1.17483082e-02  3.14812146e-03 -3.03156233e-03 -1.25315953e-02
   4.34282436e-02 -6.47700819e-03]]
In [99]:
print(pca.explained_variance_ratio_) #the variance in the Eigen vectors that we can explain. This can be visualised in the next cell below
[5.21860337e-01 1.67297684e-01 1.05626388e-01 6.54745969e-02
 5.08986889e-02 2.99641300e-02 1.99136623e-02 1.23150069e-02
 8.91215289e-03 5.09714695e-03 3.69004485e-03 2.58586200e-03
 1.98624491e-03 1.52109243e-03 1.14194232e-03 9.94191854e-04
 5.56329946e-04 1.64497408e-04]
In [100]:
#visualisation of the explained variance in the Eigen vectors

plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
In [101]:
#Elbow visualisation of variance in the Eigen vectors

plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()


Dimensionality Reduction

Now 8 dimensions seems very reasonable. With 8 variables we can explain over 95% of the variation in the original data!

In [102]:
#Using only 8 Eigen vectors that can be explained instead of the 18 vectors

pca3 = PCA(n_components=8)
pca3.fit(XScaled)
print(pca3.components_)
print(pca3.explained_variance_ratio_)
Xpca3 = pca3.transform(XScaled)
[[ 2.75283688e-01  2.93258469e-01  3.04609128e-01  2.67606877e-01
   8.05039890e-02  9.72756855e-02  3.17092750e-01 -3.14133155e-01
   3.13959064e-01  2.82830900e-01  3.09280359e-01  3.13788457e-01
   2.72047492e-01 -2.08137692e-02  4.14555082e-02  5.82250207e-02
   3.02795063e-02  7.41453913e-02]
 [-1.26953763e-01  1.25576727e-01 -7.29516436e-02 -1.89634378e-01
  -1.22174860e-01  1.07482875e-02  4.81181371e-02  1.27498515e-02
   5.99352482e-02  1.16220532e-01  6.22806229e-02  5.37843596e-02
   2.09233172e-01  4.88525148e-01 -5.50899716e-02 -1.24085090e-01
  -5.40914775e-01 -5.40354258e-01]
 [-1.19922479e-01 -2.48205467e-02 -5.60143254e-02  2.75074211e-01
   6.42012966e-01  5.91801304e-01 -9.76283108e-02  5.76484384e-02
  -1.09512416e-01 -1.70641987e-02  5.63239801e-02 -1.08840729e-01
  -3.14636493e-02  2.86277015e-01 -1.15679354e-01 -7.52828901e-02
   8.73592034e-03  3.95242743e-02]
 [ 7.83843562e-02  1.87337408e-01 -7.12008427e-02 -4.26053415e-02
   3.27257119e-02  3.14147277e-02 -9.57485748e-02  8.22901952e-02
  -9.24582989e-02  1.88005612e-01 -1.19844008e-01 -9.17449325e-02
   2.00095228e-01 -6.55051354e-02  6.04794251e-01 -6.66114117e-01
   1.05526253e-01  4.74890311e-02]
 [ 6.95178336e-02 -8.50649539e-02  4.06645651e-02 -4.61473714e-02
  -4.05494487e-02  2.13432566e-01 -1.54853055e-02  7.68518712e-02
   2.17633157e-03 -6.06366845e-02 -4.56472367e-04 -1.95548315e-02
  -6.15991681e-02  1.45530146e-01  7.29189842e-01  5.99196401e-01
  -1.00602332e-01 -2.98614819e-02]
 [ 1.44875476e-01 -3.02731148e-01 -1.38405773e-01  2.48136636e-01
   2.36932611e-01 -4.19330747e-01  1.16100153e-01 -1.41840112e-01
   9.80561329e-02 -4.61674972e-01  2.36225434e-01  1.57820194e-01
  -1.35576278e-01  2.41356821e-01  2.03209257e-01 -1.91960802e-01
   1.56939174e-01 -2.41222817e-01]
 [ 4.51862331e-01 -2.49103387e-01  7.40350569e-02 -1.76912814e-01
  -3.97876601e-01  5.03413610e-01  6.49879382e-02  1.38112945e-02
   9.66573058e-02 -1.04552173e-01  1.14622578e-01  8.37350220e-02
  -3.73944382e-01  1.11952983e-01 -8.06328902e-02 -2.84558723e-01
   1.81451818e-02  1.57237839e-02]
 [-5.66136785e-01 -1.79851809e-01  4.34748988e-01  1.01998360e-01
  -6.87147927e-02  1.61153097e-01  1.00688056e-01 -2.15497166e-01
   6.35933915e-02 -2.49495867e-01  5.02096319e-02  4.37649907e-02
  -1.08474496e-01 -3.40878491e-01  1.56487670e-01 -2.08774083e-01
  -3.04580219e-01 -3.04186304e-02]]
[0.52186034 0.16729768 0.10562639 0.0654746  0.05089869 0.02996413
 0.01991366 0.01231501]
In [103]:
Xpca3 #now we have only 8 variables instead of 18
Out[103]:
array([[ 3.34162030e-01, -2.19026358e-01,  1.00158417e+00, ...,
        -7.57446693e-01, -9.01124283e-01, -3.81106357e-01],
       [-1.59171085e+00, -4.20602982e-01, -3.69033854e-01, ...,
        -5.17161832e-01,  3.78636988e-01,  2.47058909e-01],
       [ 3.76932418e+00,  1.95282752e-01,  8.78587404e-02, ...,
         7.05041037e-01, -3.45837595e-02,  4.82771767e-01],
       ...,
       [ 4.80917387e+00, -1.24931049e-03,  5.32333105e-01, ...,
        -2.17069763e-01,  5.73248962e-01,  1.10477865e-01],
       [-3.29409242e+00, -1.00827615e+00, -3.57003198e-01, ...,
        -4.02491279e-01, -2.02405787e-01,  3.20621635e-01],
       [-4.76505347e+00,  3.34899728e-01, -5.68136078e-01, ...,
        -3.35637136e-01,  5.80978683e-02, -2.48034955e-01]])
In [104]:
#pairplot of the new variables showing no correlation

sns.pairplot(pd.DataFrame(Xpca3))
Out[104]:
<seaborn.axisgrid.PairGrid at 0x1a2e71dfd0>

Fit Linear Model

Lets construct two linear models. The first with all the 17 independent variables and the second with only the 8 new variables constructed using PCA.

In [51]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder 
le = LabelEncoder() 
y = le.fit_transform(y)
y.shape
Out[51]:
(846,)
In [52]:
# from sklearn.linear_model import LinearRegression
# regression_model = LinearRegression()
# regression_model.fit(XScaled, y)
# regression_model.score(XScaled, y)
Out[52]:
0.6602486689531759
In [53]:
# regression_model_pca = LinearRegression()
# regression_model_pca.fit(Xpca3, y)
# regression_model_pca.score(Xpca3, y)
Out[53]:
0.44738619649715605


Splitting the dataset into training and testing set

In [54]:
from sklearn.model_selection import KFold

X_train, X_test, y_train, y_test = train_test_split(df1, y, test_size = 0.1942313295, random_state = 14)

k_fold = KFold(n_splits=10, shuffle=True, random_state=0)
In [74]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.svm import SVC

svc = SVC()
#svc.fit(X_train,y_train)
Out[74]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [75]:
#Orig_y_predict = svc.predict(X_test) #predict on test data
In [76]:
#svc.score(X_test, y_test) 
Out[76]:
0.5393939393939394
In [77]:
#now split the data into 70:30 ratio

#orginal Data
Orig_X_train,Orig_X_test,Orig_y_train,Orig_y_test = train_test_split(XScaled,y,test_size=0.30,random_state=1)

#PCA Data
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(Xpca3,y,test_size=0.30,random_state=1)
In [78]:
svc.fit(Orig_X_train,Orig_y_train) #SVC on original data
Out[78]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
In [79]:
Orig_y_predict = svc.predict(Orig_X_test) #Prediction on original dataset
In [80]:
#now fit the model on pca data with new dimension
svc1 = SVC() #instantiate the object
svc1.fit(pca_X_train,pca_y_train)

#predict the y value
pca_y_predict = svc1.predict(pca_X_test) #Prediction on pca test dataset
In [81]:
#display accuracy score of both models
from sklearn.metrics import accuracy_score,confusion_matrix, classification_report,roc_auc_score

print("Model Score On Original Data ",svc.score(Orig_X_test, Orig_y_test))
print("Model Score On Reduced PCA Dimension ",svc1.score(pca_X_test, pca_y_test))
print("-------"*10)
print("Before PCA On Original 18 Dimension",accuracy_score(Orig_y_test,Orig_y_predict))
print("After PCA(On 8 dimension)",accuracy_score(pca_y_test,pca_y_predict))
Model Score On Original Data  0.952755905511811
Model Score On Reduced PCA Dimension  0.9330708661417323
----------------------------------------------------------------------
Before PCA On Original 18 Dimension 0.952755905511811
After PCA(On 8 dimension) 0.9330708661417323

Observation:

Our support vector classifier without performing PCA has an accuracy score of 95% on training data set

SVC model on PCA componenets(reduced dimensions) has an accuracy score of 93 %

Looks like by drop reducing dimensionality to 8 components, we only dropped around 2% in R^2! This is insample (on training data) and hence a drop in R^2 is expected. Still seems easy to justify the dropping of variables. An out of sample (on test data), with the 8 independent variables is likely to do better since that would be less of an over-fit.

In [ ]:
 
In [83]:
# Calculate Confusion Matrix & PLot To Visualize it

def draw_confmatrix(y_test, yhat, str1, str2, str3, datatype ):
    #Make predictions and evalute
    #model_pred = fit_test_model(model,X_train, y_train, X_test)
    cm = confusion_matrix( y_test, yhat, [0,1,2] )
    print("Confusion Matrix For :", "\n",datatype,cm )
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels = [str1, str2,str3] , yticklabels = [str1, str2,str3] )
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()
    

draw_confmatrix(Orig_y_test, Orig_y_predict,"Van ", "Car ", "Bus", "Original Data Set" )

draw_confmatrix(pca_y_test, pca_y_predict,"Van ", "Car ", "Bus", "Reduced Dimensions Using PCA ")

#Classification Report Of Model built on Raw Data
print("Classification Report For Raw Data:", "\n", classification_report(Orig_y_test,Orig_y_predict))

#Classification Report Of Model built on Principal Components:

print("Classification Report For PCA:","\n", classification_report(pca_y_test,pca_y_predict))
Confusion Matrix For : 
 Original Data Set [[ 58   0   1]
 [  1 129   3]
 [  6   1  55]]
Confusion Matrix For : 
 Reduced Dimensions Using PCA  [[ 57   2   0]
 [  2 126   5]
 [  1   7  54]]
Classification Report For Raw Data: 
               precision    recall  f1-score   support

           0       0.89      0.98      0.94        59
           1       0.99      0.97      0.98       133
           2       0.93      0.89      0.91        62

    accuracy                           0.95       254
   macro avg       0.94      0.95      0.94       254
weighted avg       0.95      0.95      0.95       254

Classification Report For PCA: 
               precision    recall  f1-score   support

           0       0.95      0.97      0.96        59
           1       0.93      0.95      0.94       133
           2       0.92      0.87      0.89        62

    accuracy                           0.93       254
   macro avg       0.93      0.93      0.93       254
weighted avg       0.93      0.93      0.93       254

Observations

Confusion Matrix On Original Data:

  • Our model on original data set has correctly classified 58 van out of 59 actuals vans with only 1 wrongly predicted as Car

  • our model has correcly classified 129 cars and has wrongly classified 3 cars to be a bus and also 1 car to be a van

  • Again, in the case of 62 instances of actual bus , our model has correctly classified 56 buses , It has faltered in classifying wrongly 6 buses to be a van and 1 bus to be a car.

Confusion Metric On Reduced Dimesnion After PCA :

  • Out of 59 actual instances of vans our model has correctly predicted 57 vans and errored in 2 instances where it wrongly classified vans to be a car.
  • Out of 133 actuals cars , our mdoel has correclty classified 126 of them to be a car and faltered in 7 cases where it wrongly classified 5 cars to a bus and 2 cars to be a van.

  • Out of 62 actual bus , our model has correclty classified 54 of them to be a bus. It has faltered in 8 cases where it wrongly classified 7 bus to be a car and 1 bus to be a van.

Insights On Classification Reports:

On original data:

  • our model has 99 % precison score when it comes to classify car from the given set of silhoutte parameters. It has 89 % precision when it comes to classifying the input as van, while it has 93 % precison when it come to predict data as bus.

  • In term of recall score our model has recal score of 98 % for van classification, 97 % for car and 89 % for bus.

  • OUr model has an weighted average of 95 % for all classification metrics.

On Reduced Dimensions After PCA:

  • Our model has highest precision score of 95 % when it comes to predict van type, which is better as compared to predcition done on original data set, which came out with the precision score of 89 % for van.
  • Recall score is almost neck to neck with what our model scored on original data set. It showed highest recall score of 97 % in classifying data as car.


MrBriit